Document ranking using web evidence
نویسنده
چکیده
Evidence based on web graph structure is reportedly used by the current generation of World-Wide Web (WWW) search engines to identify “high-quality”, “important” pages and to reject “spam” content. However, despite the apparent wide use of this evidence its application in web-based document retrieval is controversial. Confusion exists as to how to incorporate web evidence in document ranking, and whether such evidence is in fact useful. This thesis demonstrates howweb evidence can be used to improve retrieval effectiveness for navigational search tasks. Fundamental questions investigated include: which forms of web evidence are useful, how web evidence should be combined with other document evidence, and what biases are present in web evidence. Through investigating these questions, this thesis presents a number of findings regarding howweb evidence may be effectively used in a general-purpose web-based document ranking algorithm. The results of experimentation with well-known forms of web evidence on several small-to-medium collections of web data are surprising. Aggregate anchor-text measures perform well, but well-studied hyperlink recommendation algorithms are far less useful. Further gains in retrieval effectiveness are achieved for anchor-text measures by revising traditional full-text rankingmethods to favour aggregate anchor-text documents containing large volumes of anchor-text. For home page finding tasks additional gains are achieved by including a simple URL depth measure which favours short URLs over long ones. The most effective combination of evidence treats document-level and web-based evidence as separate document components, and uses a linear combination to sum scores. It is submitted that the document-level evidence contains the author’s description of document contents, and that the web-based evidence gives the wider web community view of the document. Consequently if both measures agree, and the document is scored highly in both cases, this is a strong indication that the page is what it claims to be. A linear combination of the two types of evidence is found to be particularly effective, achieving the highest retrieval effectiveness of any query-dependent evidence on navigational and Topic Distillation tasks. However, care should be taken when using hyperlink-based evidence as a direct measure of document quality. Thesis experiments show the existence of bias towards the home pages of large, popular and technology-oriented companies. Further empirical evidence is presented to demonstrate how the authorship of web documents and sites directly affects the quantity and quality of available web evidence. These factors demonstrate the need for robust methods for mining and interpreting data from the web graph.
منابع مشابه
RRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features
Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...
متن کاملAn Ensemble Click Model for Web Document Ranking
Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...
متن کاملWeb pages ranking algorithm based on reinforcement learning and user feedback
The main challenge of a search engine is ranking web documents to provide the best response to a user`s query. Despite the huge number of the extracted results for user`s query, only a small number of the first results are examined by users; therefore, the insertion of the related results in the first ranks is of great importance. In this paper, a ranking algorithm based on the reinforcement le...
متن کاملInvestigating the Impact of Authors’ Rank in Bibliographic Networks on Expertise Retrieval
Background and Aim: this research investigates the impact of authors’ rank in Bibliographic networks on document-centered model of Expertise Retrieval. Its purpose is to find out what kind of authors’ ranking in bibliographic networks can improve the performance of document-centered model. Methodology: Current research is an experimental one. To operationalize research goals, a new test colle...
متن کاملمدل جدیدی برای جستجوی عبارت بر اساس کمینه جابهجایی وزندار
Finding high-quality web pages is one of the most important tasks of search engines. The relevance between the documents found and the query searched depends on the user observation and increases the complexity of ranking algorithms. The other issue is that users often explore just the first 10 to 20 results while millions of pages related to a query may exist. So search engines have to use sui...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005